Batch Ensembl CDS, stabilize BioMart downloads, and improve PAE handling by St3451 · Pull Request #80 · bbglab/oncodrive3d

St3451 · 2026-02-16T15:22:30Z

Summary

This PR makes the build pipeline faster and more reliable under current Ensembl/BioMart/AF DB constraints. Ensembl CDS is now batched, BioMart downloads have fallbacks and clearer logs, and PAE handling is robust to missing AF DB versions with a new --custom_pae_dir. MANE builds are explicitly pinned to AF DB v4 for consistency, and several edge‑case crashes plus doc/debug ergonomics were cleaned up. Default AF DB version for non‑MANE builds is now v6 (latest at the time of this PR). Backtranseq batching/retry/timeouts were hardened to avoid multi‑hour hangs.

Issue 1: Ensembl CDS retrieval speed

What: Switched CDS retrieval to batched POST requests (50 IDs/request) with 429 handling.
Why: Per‑transcript GET became too slow and rate‑limited.
How:
- Batch ID POST to /sequence/id; retry on 429; keep result order.
- Batch requests are capped to 8 workers; missing IDs are retried individually in parallel (up to 8) after the batch pass.
Notes:
- Short CDS now treated as missing NA to avoid invalid contexts.
- Batch retries are now 8 attempts (was 10 in the note).
- Failed batch IDs are retried one‑by‑one before being moved to non‑MANE.
- Added debug log for missing custom ENSP IDs not present in MANE summary.

Issue 2: BioMart metadata download instability

What: Added archive to latest fallback and Python downloader fallback; improved logs.
Why: Archive endpoints sometimes return 500/timeout.
How:
- Retry loops with stderr capture, start‑attempt logs, --no-hsts, and SSL verify for HTTPS fallback.
- Include the cap of download segments to 8 for download_single_file().
Notes: If downloads fail, canonical transcript prioritization is skipped (CDS obtained from Proteins API) and build continue.

Issue 3: PAE availability and custom input

What: Added --custom_pae_dir; skip download when 10 consecutive 404/410s detected.
Why: PAE URLs for older AF versions are no longer hosted.
How: Copy provided directory into pae/; probe first 10 IDs sequentially, then parallel download.
Notes: If PAE is missing, pCMAPs fall back to binary contact maps.

Issue 4: MANE + AF version consistency

What: Force af_version=4 when --mane/--mane_only is used.
Why: MANE structures are only available from AF DB v4.
How:
- Override version early with a warning and reuse across the build.
- AF fragment merge fix: merge_af_fragments now receives af_version (fixes v4 hardcode when default is v6).
Notes: Update default non‑MANE --af_version is now 6.

Issue 5: Backtranseq robustness

What: Reduce batch size to 100, add non‑200 logging, add 45‑min total timeout with max 5 retries per batch.
Why: Large single batches were timing out/hanging with no exit path.
How: Cap batch size; add HTTP logging; retry with bounded time; return NaN on failure.
Notes: Failures now surface earlier but do not block the build.

Issue 6: Sequence/PDB hygiene & mapping

SEQRES guard: skip SEQRES insertion if already present (custom PDBs + fragment merge) and log skipped counts.
Custom PDB copy logging: info‑level summary of how many custom PDBs were copied (and skipped invalid filenames); debug count for SEQRES insertions.
Proteins API guard: skip ENSP IDs when querying Proteins API to go straight to Backtranseq.
Custom MANE symbol propagation: custom samplesheet symbol/gene now fills ENSP‑only entries; debug log warns if custom ENSPs aren’t in MANE summary.

Issue 7: Developer UX & tooling

Fix prepare_samplesheet.py so it runs directly from its folder (adds repo root to sys.path, avoiding ModuleNotFoundError: scripts).
Preprocessing tool behavior: update_samplesheet_and_structures.py always adds symbol to final bundles; samplesheet.csv is kept clean (no symbol/CGC/length). --include-metadata now only adds CGC/length.
Removed hardcoded local paths from build_datasets.py and kept CLI‑only guard.
Add empty‑df checks, NA guards, doc fixes, new launch configs.

…fallback URL

…g and optimized processing

…necessary wrapper

Copilot

Pull request overview

This pull request enhances the robustness and monitoring of dataset building operations in scripts/datasets/seq_for_mut_prob.py. It introduces fallback mechanisms for BioMart metadata downloads and adds progress tracking for Ensembl CDS retrieval, improving reliability when external services are unavailable or slow.

Changes:

Refactored download_biomart_metadata to include retry logic, fallback from archive to latest Ensembl server, and Python-based downloader when wget is unavailable
Added progress monitoring with tqdm for Ensembl CDS sequence retrieval
Simplified multiprocessing by removing unnecessary wrapper function

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

scripts/datasets/seq_for_mut_prob.py

…andling and logging

…fic names

…ction

…andling and fallback mechanism

… function

…s before fallback

…equest handling

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

scripts/datasets/seq_for_mut_prob.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

scripts/datasets/seq_for_mut_prob.py:956

This branch is effectively unreachable and sys.exit() is risky inside a library-style helper (and especially problematic under multiprocessing). r.raise_for_status() will raise for 4xx/5xx, so status = "ERROR"; sys.exit() won’t run; if the logic changes later it could unexpectedly terminate the whole process. Consider removing this block and handling non-OK responses via exceptions/retries and returning np.nan on terminal failure.

            if not r.ok:
                r.raise_for_status()

                status = "ERROR"
                sys.exit()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

scripts/datasets/seq_for_mut_prob.py

… handling

… requests

…t_metadata

…ble HSTS

…irectory

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (2)

scripts/datasets/seq_for_mut_prob.py:1456

The main block (lines 1446-1456) contains hardcoded local paths for testing, which should not be in production code. While the PR description mentions removing hardcoded paths from build_datasets.py and adding a CLI-only guard, this file still has test code. Consider removing this test code or replacing it with a similar guard that directs users to use the CLI.

if __name__ == "__main__":
    raise SystemExit(
        "This module is intended to be used via the CLI: `oncodrive3d build-datasets`."
    )

scripts/datasets/seq_for_mut_prob.py:1237

The function signature changed to remove ens_canonical_transcripts_lst, custom_mane_metadata_path, and mane_version parameters, and add mane_only parameter. The docstring should be updated to reflect these changes and explain the new behavior, particularly around how mane_only affects the filtering of non-MANE sequences.

                        uniprot_to_gene_dict,
                        mane_mapping,
                        mane_mapping_not_af,
                        mane_only=False,
                        num_cores=1):
    """
    Retrieve DNA sequence and tri-nucleotide context
    for each structure in the initialized dataframe
    prioritizing MANE associated structures and metadata.

    Reference_info labels:
        1  : Transcript ID, exons coord, seq DNA obtained from Proteins API
        0  : Transcript ID retrieved from MANE and seq DNA from Ensembl
        -1 : Not available transcripts, seq DNA retrieved from Backtranseq API
    """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

scripts/datasets/seq_for_mut_prob.py

scripts/datasets/build_datasets.py

scripts/main.py

scripts/datasets/build_datasets.py

scripts/datasets/get_pae.py

scripts/datasets/seq_for_mut_prob.py

scripts/datasets/get_pae.py

scripts/datasets/seq_for_mut_prob.py

… metadata retrieval

…recomputed symbol map and only add CGC + length on top

…f operations

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

scripts/datasets/get_pae.py

scripts/datasets/build_datasets.py

…are_samplesheet.py

…to 8

…wait time in get_ref_dna_from_ensembl_batch;

…entries

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tools/preprocessing/update_samplesheet_and_structures.py

…sertions

…issing data

…bol_map function

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

scripts/datasets/get_pae.py:63

If the response is HTTP 200 but the body doesn’t match the expected JSON pattern (content.endswith(b'}]')), status remains INIT, which causes the loop to immediately retry without the 30s backoff. Set status = "ERROR" (or otherwise sleep) when content validation fails to avoid tight retry loops and rate-limiting.

            content = response.content
            if content.endswith(b'}]') and not content.endswith(b'</Error>'):
                with open(file_path, 'wb') as output_file:
                    output_file.write(content)
                status = "FINISHED"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-19T17:31:14Z

scripts/datasets/af_merge.py

@@ -224,12 +224,17 @@ def get_pdb_seqres_records(lst_res):
 def add_refseq_record_to_pdb(path_structure):
    """
    Add the SEQREF records to the pdb file.


The docstring says “Add the SEQREF records”, but the code is inserting SEQRES records. Update the docstring to match the PDB record type to avoid confusion.

Suggested change

Add the SEQREF records to the pdb file.

Add the SEQRES records to the pdb file.

Copilot · 2026-02-19T17:31:14Z

tools/preprocessing/update_samplesheet_and_structures.py

-        samplesheet = attach_metadata(samplesheet, metadata_map)
-        samplesheet.to_csv(paths.samplesheet_path, index=False)
+
+    metadata_for_outputs = metadata_map or symbol_map


metadata_for_outputs = metadata_map or symbol_map will raise ValueError: The truth value of a DataFrame is ambiguous when metadata_map is a DataFrame. Use an explicit is not None check (e.g., choose metadata_map if it’s not None, otherwise symbol_map).

Suggested change

metadata_for_outputs = metadata_map or symbol_map

metadata_for_outputs = metadata_map if metadata_map is not None else symbol_map

tools/preprocessing/update_samplesheet_and_structures.py

…QRES already exists

…ce and reliability

…API responses

…ing for improved reliability

St3451 added 3 commits February 16, 2026 12:45

fix: enhance download_biomart_metadata function with retry logic and …

31bac06

…fallback URL

feat: improve get_ref_dna_from_ensembl_mp function with error handlin…

b268d2a

…g and optimized processing

refactor: streamline get_ref_dna_from_ensembl function by removing un…

86796c0

…necessary wrapper

Copilot AI review requested due to automatic review settings February 16, 2026 15:22

St3451 changed the title ~~Dev/fix build datasets~~ Fix issue in build-datasets Feb 16, 2026

Copilot started reviewing on behalf of St3451 February 16, 2026 15:22 View session

Copilot AI reviewed Feb 16, 2026

View reviewed changes

St3451 added 9 commits February 16, 2026 17:35

logs: enhance get_ref_dna_from_ensembl function with improved error h…

fea7d2f

…andling and logging

docs: update organism specification in README to include full scienti…

a402cde

…fic names

refactor: simplify multiprocessing in get_ref_dna_from_ensembl_mp fun…

d961730

…ction

fix: enhance download_biomart_metadata function with improved error h…

8c2ce07

…andling and fallback mechanism

logs: improve error handling and logging in download_biomart_metadata…

dbf0cd2

… function

fix: improve error handling by removing partial BioMart metadata file…

519d92c

…s before fallback

feat: add batch retrieval for Ensembl CDS DNA sequences and improve r…

2db7336

…equest handling

feat: handle rate limiting in get_ref_dna_from_ensembl_batch function

3753b2c

increase max_attempts in download_biomart_metadata to 5

826c239

St3451 requested a review from Copilot February 17, 2026 00:50

Copilot started reviewing on behalf of St3451 February 17, 2026 00:50 View session

St3451 changed the title ~~Fix issue in build-datasets~~ Speed up Ensembl CDS retrieval with batched REST requests AND add BioMart download fallbacks Feb 17, 2026

St3451 added 2 commits February 17, 2026 01:55

fix: add logging import to seq_for_mut_prob.py

2cf9865

remove outdated download_biomart_metadata function

2c006e3

Copilot AI reviewed Feb 17, 2026

View reviewed changes

fix: handle empty dataframes in process_seq_df and process_seq_df_mane

08eb1bb

St3451 requested a review from Copilot February 17, 2026 10:29

Copilot started reviewing on behalf of St3451 February 17, 2026 10:30 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

scripts/datasets/seq_for_mut_prob.py Show resolved Hide resolved

scripts/datasets/seq_for_mut_prob.py Outdated Show resolved Hide resolved

scripts/datasets/seq_for_mut_prob.py Show resolved Hide resolved

St3451 added 4 commits February 17, 2026 11:51

logs: enhance logging for BioMart download failures and improve error…

40b3432

… handling

logs: add warning log for exceeding max attempts in Ensembl CDS batch…

62e8df5

… requests

fix: add SSL verification for download_single_file in download_biomar…

a29831d

…t_metadata

logs: update headers for Ensembl REST API and add wget option to disa…

5c2a26f

…ble HSTS

fix: remove existing PAE output directory before copying custom PAE d…

94f8f2e

…irectory

Copilot AI reviewed Feb 18, 2026

View reviewed changes

St3451 added 4 commits February 18, 2026 12:14

docs: update process_seq_df docstring to include canonical transcript…

91e22f4

… metadata retrieval

feat: add function to load custom gene symbol mappings from samplesheet

eaa0b9e

frefactor: update samplesheet tool build_metadata_map() to accept a p…

f7853af

…recomputed symbol map and only add CGC + length on top

logs: enhance PDB copying process with detailed logging and summary o…

d8ea655

…f operations

St3451 requested a review from Copilot February 18, 2026 12:20

Copilot started reviewing on behalf of St3451 February 18, 2026 12:20 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

scripts/datasets/get_pae.py Outdated Show resolved Hide resolved

scripts/datasets/get_pae.py Show resolved Hide resolved

scripts/datasets/build_datasets.py Show resolved Hide resolved

St3451 added 6 commits February 18, 2026 13:43

fix: ensure REPO_ROOT is added to sys.path for module imports in prep…

3c3f055

…are_samplesheet.py

limit number of connections in download_single_file to a maximum of 10

50498ac

reduce maximum number of connections in download_single_file from 10 …

0ed4e61

…to 8

feat: add retry logic for missing entries; increase max attempts and …

7967d1b

…wait time in get_ref_dna_from_ensembl_batch;

cap Ensembl CDS batch workers to a maximum number of cores

3a348b5

feat: implement bounded parallelism for retrying missing Ensembl CDS …

2c5c109

…entries

St3451 requested a review from Copilot February 19, 2026 09:47

Copilot started reviewing on behalf of St3451 February 19, 2026 09:48 View session

fix: handle consecutive missing PAE downloads correctly

ea63130

Copilot AI reviewed Feb 19, 2026

View reviewed changes

tools/preprocessing/update_samplesheet_and_structures.py Show resolved Hide resolved

tools/preprocessing/update_samplesheet_and_structures.py Outdated Show resolved Hide resolved

St3451 added 3 commits February 19, 2026 17:32

logs: enhance logging for Ensembl CDS retrieval with sequence count

cdf9f6a

fix: prevent duplicate SEQRES records in PDB files and log skipped in…

2209b0e

…sertions

fix: handle ENSP IDs in get_exons_coord function and return NaN for m…

d7bf403

…issing data

St3451 requested a review from Copilot February 19, 2026 17:24

Copilot started reviewing on behalf of St3451 February 19, 2026 17:25 View session

update symbol assignment to use pd.NA for missing values in build_sym…

6eeadf3

…bol_map function

Copilot AI reviewed Feb 19, 2026

View reviewed changes

St3451 added 4 commits February 19, 2026 23:01

fix: update add_seqres_to_pdb to return bool and skip insertion if SE…

9170592

…QRES already exists

fix: reduce batch size for Backtranseq API calls to improve performan…

3d8d4e9

…ce and reliability

logs: enhance error handling and logging in backtranseq function for …

44255a5

…API responses

feat: enhance backtranseq function with retry logic and timeout handl…

b836ccf

…ing for improved reliability

	Add the SEQREF records to the pdb file.
	Add the SEQRES records to the pdb file.

	metadata_for_outputs = metadata_map or symbol_map
	metadata_for_outputs = metadata_map if metadata_map is not None else symbol_map

Conversation

St3451 commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issue 1: Ensembl CDS retrieval speed

Issue 2: BioMart metadata download instability

Issue 3: PAE availability and custom input

Issue 4: MANE + AF version consistency

Issue 5: Backtranseq robustness

Issue 6: Sequence/PDB hygiene & mapping

Issue 7: Developer UX & tooling

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

St3451 commented Feb 16, 2026 •

edited

Loading